We are running three classifiers and blending their results:
_mixlbp.csv contains all the features (below) used for the experiments (as well as for final competition participation). It contains all the samples of the training set. For this demo, we split this set 9:1, where 90% is used for training and 10% for validation.
We then run three different classifiers on these sets (training on the larger one and validating on the smaller one), and try to blend the results by simple voting in order to decrease the resulting log loss (computed on the validation dataset).
We are training with two types of features:
In [1]:
from SupervisedLearning import SKSupervisedLearning
from train_files import TrainFiles
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
from tr_utils import vote
import matplotlib.pylab as plt
import numpy as np
from train_nn import createDataSets, train
Let's define a function that plots the confusion matrix to see how accurate our predictions really are
In [2]:
def plot_confusion(sl):
conf_mat = confusion_matrix(sl.Y_test, sl.clf.predict(sl.X_test_scaled)).astype(dtype='float')
norm_conf_mat = conf_mat / conf_mat.sum(axis = 1)[:, None]
fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
ax.set_aspect(1)
res = ax.imshow(norm_conf_mat, cmap=plt.cm.jet,
interpolation='nearest')
cb = fig.colorbar(res)
labs = np.unique(Y_test)
x = labs - 1
plt.xticks(x, labs)
plt.yticks(x, labs)
for i in x:
for j in x:
ax.text(i - 0.2, j + 0.2, "{:3.0f}".format(norm_conf_mat[j, i] * 100.))
return conf_mat
In [3]:
train_path_mix = "./mix_lbp.csv"
labels_file = "./trainLabels.csv"
X, Y_train, Xt, Y_test = TrainFiles.from_csv(train_path_mix, test_size = 0.1)
The last line above does the following:
Training consists of training three models
We neatly wrap this into our $\color{green}{SKSupervisedLearning}$ class The procedure is simple:
In [4]:
sl = SKSupervisedLearning(SVC, X, Y_train, Xt, Y_test)
sl.fit_standard_scaler()
sl.train_params = {'C': 100, 'gamma': 0.01, 'probability' : True}
ll_trn, ll_tst = sl.fit_and_validate()
print "SVC log loss: ", ll_tst
You can play with the parameters here to see how log loss changes. SKSupervisedLearning wraps the sklearn grid search technique for searching for optimal parameters in one call. You can take a look at the implementation details.
Let's plot the confusion matrix to see how well we are doing (values inside squares are %s): (change magic below to %matplotlib qt to get the out-of-browser graph)
In [5]:
%matplotlib inline
conf_svm = plot_confusion(sl)
As expected, we are not doing so well in class 5 where there are very few samples.
This is a fun one, I promise. :)
The neural net is built by PyBrain, has just one hidden layer which is equal to $\frac{1}{4}$th of the input layer. The hidden layer activation is sigmoid, the output - softmax (since this is a multi-class neural net), and has bias units for the hidden and the output layers. We use the PyBrain $\color{green}{buildNetwork()}$ function that builds the network in one call.
NOTE: We are still using all the scaled features to train the neural net
I am setting %matplotlib to qt so training can be watched in real time You will see each training epoch charted. The graph on the left shows % error, the one on the right - log loss.
You can play with the "test error" or "epochs" parameter to control how long it runs.Limiting it to just 10 epochs for this experiment
In [6]:
%matplotlib qt
trndata, tstdata = createDataSets(sl.X_train_scaled, Y_train, sl.X_test_scaled, Y_test)
fnn = train(trndata, tstdata, epochs = 10, test_error = 0.07, momentum = 0.15, weight_decay = 0.0001)
Finally, we train the random forest (which happens to train in seconds) with the calibration classifier (which takes 2 hours or so)
Random forests are very accurate, the problem is that they maker over-confident predictions (or at least that is what the predict_proba function, which is supposed to return probabilities of each class gives us). So, god forbid we are ever wrong! Since it predicts the probability of 0 on the correct class, log loss goes to infinity. Calibration classifier makes predict_proba return something sane.
In [8]:
sl_ccrf = SKSupervisedLearning(CalibratedClassifierCV, X, Y_train, Xt, Y_test)
sl_ccrf.train_params = \
{'base_estimator': RandomForestClassifier(**{'n_estimators' : 7500, 'max_depth' : 200}), 'cv': 10}
sl_ccrf.fit_standard_scaler()
ll_ccrf_trn, ll_ccrf_tst = sl_ccrf.fit_and_validate()
print "Calibrated log loss: ", ll_ccrf_tst
As you can see, we are simply wrapping the $\color{green}{RandomForestClassifier}$ in the $\color{green}{CalibratedClassifier}$. Plot the matrix (after a couple of hours):
In [10]:
%matplotlib inline
conf_ccrf = plot_confusion(sl_ccrf)
In [11]:
%matplotlib inline
x = 1. / np.arange(1., 6)
y = 1 - x
xx, yy = np.meshgrid(x, y)
lls1 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
lls2 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
for i, x_ in enumerate(x):
for j, y_ in enumerate(y):
proba = vote([sl.proba_test, sl_ccrf.proba_test], [x_, y_])
lls1[i, j] = log_loss(Y_test, proba)
proba = vote([sl.proba_test, sl_ccrf.proba_test], [y_, x_])
lls2[i, j] = log_loss(Y_test, proba)
fig = plt.figure()
plt.clf()
ax = fig.add_subplot(121)
ax1 = fig.add_subplot(122)
ax.set_aspect(1)
ax1.set_aspect(1)
res = ax.imshow(lls1, cmap=plt.cm.jet,
interpolation='nearest')
res = ax1.imshow(lls2, cmap=plt.cm.jet,
interpolation='nearest')
cb = fig.colorbar(res)
The graphs show a "blended" log loss. A matrix on the left blends SVM and RF log losses with weights "favoring" SVM, and the one on the right "favors" RF.